Joint-Attention Learning in Prosody Transfer Speech Synthesis

Demo

Paper: Gao, Y., Raj, B., Singh, R. (2019) Joint-Attention Learning in Prosody Transfer Speech Synthesis

Text-to-speech (TTS) research aims to develop models that can produce natural sound synthesized utterance, given a piece of text as input. Pushing the edge of the general naturalness of the synthesized utterance, several state-of-the-art models such as Tacotron and DeepVoice3 achieve excellent results in improving the quality of synthesized speech. To aim at more realistic speech synthesis, prosody-flexible TTS, also called expressive TTS has recently becomes a topic of significant research.

In this work, we propose a prosody transfer text-to-speech synthesis model. A token table and weights are also learned with the reference input to factorize the possible styles in an unsupervised manner. The results show our model can successfully factorize the reference prosodies to represent characteristics of different speakers and styles, under unsupervised learning from the training data.

Proposed model

title

1. Token Representation

Token representation in our model are an approach to factorize prosodies of the training dataset. During the test phase, if we only clip to specific token, the synthesis result would represent that token's learned prosody.

1.1 Assigned tokens gave different speaker voice as synthesis results

To validate the effectiveness of our model, we train the prposed model using VCTK dataset and set the speaker to be one. The most significant variation in the dataset is the speaker voice characteristics. The token is supposed to learn different speaker's voice. In the test, we found the results are as expectation. The synthesized results clipped to specific token are voices of different person.

Uterrance: "I’ve felt the chance that I have a number of options."

To listen, files are at following:

Token1: 
Token2: 
Token3: 
Token4: 
Token5: 

1.2 Assigned tokens gave different synthesis results

To further test the model, we train the model using an internal dataset. There is only one speaker inside the dataset and the tokens are supposed to learn different prosody representation as the most significant style variation inside the dataset.

Uterrance: "Just recovered a fumble on ensuing kickoff."

To listen, files are at following:

Token1: 
Token2: 
Token3: 
Token4: 
Token5: 

1.3 Token Representation on Blizzard 2013 dataset

Another example is that we train the proposed model on Blizzard 2013 dataset, which is a single speaker dataset containing recordings of audio books. The books vary from novels to bible, fictions to narrative. So the prosodies inside the dataset vary signicantly, ideal for style learning. Again, here, different token is supposed to represent different prosody.

Uterrance: "Just recovered a fumble on ensuing kickoff."

To listen, files are at following:

Token1: 
Token2: 
Token3: 
Token4: 
Token5: 

2. Prosody Transfer

In this section, we show the results of prosody transfer from referene utterance to test utterance.

2.1 Parallel utterances

The following shows three example of prosody transfer synthesis.

In each example, the text of utterance to synthesis is the same as the reference's.

Example 1

Utterance text content: My mother always took him to the town on a market day in a light gig.

Refence utterance:
Neutral prosody result:
Prosody Transfer result:

Example 2

Utterance text content: So we never saw Dick any more.

Refence utterance:
Neutral prosody result:
Prosody Transfer result:

Example 3

Utterance text content: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?

Refence utterance:
Neutral prosody result:
Prosody Transfer result:

2.2 Unparallel utterance

The following shows three example of unparallel prosody transfer synthesis.

In each example, text of the utterance to synthesis is different from the reference's.

Example 1

The prosody of the unparallel reference utterance will be transfered to the synthesis results having different text contents.

Reference utterance: 
Text: My mother always took him to the town on a market day in a light gig.


Prosody Transfer text 1: 
Text: So we never saw Dick any more.


Prosody Transfer text 2: 
Text: Just recovered a fumble on ensuing kickoff.


Example 2

Reference utterance:
Text: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?


Prosody Transfer text 1:
Text: My mother always took him to the town on a market day in a light gig.


Prosody Transfer text 2:
Text: There was nothing disagreeable in Mister Rushworth's appearance.


Example 3

Reference utterance:
Text: There was nothing disagreeable in Mister Rushworth's appearance, and Sir Thomas was liking him already.


Prosody Transfer text 1:
Text: Just recovered a fumble on ensuing kickoff.


Prosody Transfer text 2:
Text: My mother always took him to the town on a market day in a light gig.